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Abstract 

Fault tolerant algorithms for the numerical approximation of elliptic 
partial differential equations on modern supercomputers play a more and 
more important role in the future design of exa-scale enabled iterative 
solvers. Here, we combine domain partitioning with highly scalable geo¬ 
metric multigrid schemes to obtain fast and fault-robust solvers in three 
dimensions. The recovery strategy is based on a hierarchical hybrid con¬ 
cept where the values on lower dimensional primitives such as faces are 
stored redundantly and thus can be recovered easily in case of a failure. 
The lost volume unknowns in the faulty region are re-computed approx¬ 
imately with multigrid cycles by solving a local Dirichlet problem on the 
faulty subdomain. Different strategies are compared and evaluated with 
respect to performance, computational cost, and speed up. Especially 
effective are strategies in which the local recovery in the faulty region 
is executed in parallel with global solves and when the local recovery is 
additionally accelerated. This results in an asynchronous multigrid itera¬ 
tion that can fully compensate faults. Excellent parallel performance on 
a current peta-scale system is demonstrated. 


Keywords: fault tolerant algorithms, highly scalable multigrid, massive 
parallel and asynchron solvers 
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1 Introduction 

Future high performance systems will be characterized by millions of compute 
nodes that are executing up to a billion of parallel threads. This compute power 

^Institute for Numerical Mathematics (M2), Technische Universitat Miinchen, Boltz- 
mannstrasse 3, D—85748 Garching b. Miinchen, Germany 

^Department of Computer Science 10, FAU Erlangen-Niirnberg, Cauerstrafie 6, D-91058 
Erlangen, Germany 


1 



Resilience for Multigrid Software 


2 


will be extremely expensive not only with respect to acquisition costs but also 
due to the operational costs, whereby the energy consumption is becoming a 
major concern. The increasing system size results in a higher probability of 
failure of the components of the HPC-system [TS], and thus fail-safe performance 
is one of the new challenges in extreme scale computing. 

Faults can be classihed in fail-stop and fail-continue, also called hard and soft 
errors, respectively, see, e.g., [mnii. In the first case, the process stops, e.g., 
due to a permanent node crash or incorrect execution path which interrupts 
the program and results in a loss of the state of the process. In the case of 
soft errors, the process continues but the failure affects the execution through 
“bit-flips” (transient errors). Fault tolerance techniques can be categorized in 
hardware-based fault tolerance (HBFT) [311112] j system software-based fault 
tolerance (SBFT) [8l[Tni[12l|26l|52, and algorithm-based fault tolerance (ABFT) 
[Tbl [Ill [271 |35l [38]. For a general overview and a classihcation, we refer to 

miniiiH]. 

Achieving resilience is costly, since it always requires some form of redun¬ 
dancy, and thus a duplication of system resources and extra energy. In particu¬ 
lar, traditional checkpoint strategies must collect and transfer the data regularly 
from all compute nodes and store the data to backup memory [SlllMlIlIlIlZl- In 
large systems, this may be too expensive and slow. Consequently, algorithmic 
alternatives are required. It is only natural that the most efficient resilience 
techniques will have to exploit specific features of the algorithms. 

Under the assumption that the failure can be detected, such ABFT strategies 
implement the resilience in the algorithm itself and thus guarantee reliable re¬ 
sults. Originally, ABFT was proposed by Huang and Abraham [35] for systolic 
arrays where checksums monitor the data and are used for a reconstruction. 
Later on, it was extended to applications in linear algebra such as addition, 
matrix operations, scalar product, LU-decomposition, transposition and in fast 
Fourier transformation [a[iii[2ii[32]. The work by Davies and Chen [25] deals 
with fault detection and correction during the calculation for dense matrix oper¬ 
ations. For sparse matrix iterative solvers, such as SOR, GMRES, CG-iterations 
the previously mentioned approaches are not suitable due to a possible high over¬ 
head [IH] and were extended by [T] [TS] HOI IHl 110] . In [^ , Cui et al. exploit the 
structure of a parallel subspace correction method such that the subspaces are 
redundant on different processors, and the workload is efficiently balanced. 

At this time, the immediate detection and the replacement of the faulty com¬ 
ponents is not part of the standard message passing interface (MPI), however, 
fault tolerant MPI versions, such as Harness FT-MPI [35] or ULFM [5] [0] are 
under development. They do also not yet provide an instant reporting of the 
failure, but may do so in future versions when the hardware itself is extended 
to better support such features. For the purposes of this paper, we assume that 
such extensions are already available. 

In this article, we thus consider fail-stop errors as they may occur in itera¬ 
tive schemes when solving discretized elliptic partial differential equations. We 
focus on the design of fault tolerant parallel geometric multigrid methods since 
geometric multigrid methods are well-known for their asymptotic optimal com- 
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plexity and excellent parallel efficiency [H [Ml [311 [331 EOl EH ■ The influence 
of soft faults on an algebraic multigrid solver was studied in m- Similar to 
[23] . we pursue fault tolerance strategies for hard faults that 

• converge when a fault occurs, assuming it is detectable, 

• minimize the delay in the solution process, 

• minimize computational and communication overhead. 

In order to compensate for a hard fault, we decouple the faulty part from 
the intact part and exploit the fact that only relatively little of the data must 
be stored redundantly so that a special recovery process can reconstruct the 
bulk of the missing data efficiently. Furthermore, we find that the redundant 
data as it is required for the reconstruction of lost data is readily available in 
suitably designed data structures. These data structures can be found in many 
distributed memory parallel multigrid solvers that use a domain partitioning 
and ghost nodes for communicating. 

Our design goal is to minimize the time delay in an iterative solver that seems 
inevitable when a fault has destroyed part of its internal state. Here, however, 
we will show that the flexibility of modern heterogeneous architectures can be 
used beneficially. If the recovery is executed with powerful enough resources, 
the loss can in many cases be compensated completely and the time to solution 
experiences no delay. In other words, by using a computational superman to 
recover the lost state, we are able to fully compensate a fault. 

The rest of our paper is organized as follows: In Sec. we describe the 
model PDF and fault setting. The required data structure for our recovery 
strategies will be discussed in Sec. The redundancy in the ghost layer data 
structures allows to combine full multigrid efficiency with tearing and intercon¬ 
necting strategies. In Sec. we then develop local recovery strategies and study 
numerically their influence on the global convergence. The main algorithmic re¬ 
sult can be found in Sec. ??. Here Dirichlet-Neumann and Dirichlet-Dirichlet 
coupling strategies are combined with fast hierarchical multigrid methods. The 
effect of the size of the faulty domain is illustrated for large scale computa¬ 
tions. In Sec.[^ we test both new global recovery strategies on a state-of-the-art 
peta-scale system. To fully compensate for the fault with respect to the time 
to solution, we enhance the massively parallel geometric multigrid method by 
asynchronous components on partitioned domains. 


2 Model problem and fault setting 

In this section, we introduce the model PDF scenario, the notation for the geo¬ 
metrical multigrid solver and the fault model for which we design our recovery 
techniques. 
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2.1 Model problem and parallel multigrid solver 

For the sake of simplicity, we will illustrate the methods for the Laplace equation 
in 3D with inhomogeneous Dirichlet boundary conditions 

— Au = 0 in D, u = g on dfl (1) 

as PDF model problem. The generalization to more general boundary condi¬ 
tions, right hand sides or other scalar elliptic problems is straightforward. In 
Sec. 1^ we will also present a numerical example for the Stokes system. Here, 
D C is a bounded polyhedral domain which is triangulated with an un¬ 
structured base mesh T- 2 - In the following, we will assume for simplicity that 
tetrahedral elements are used, but again our techniques generalize readily to 
hexahedral and hybrid meshes. From this initial coarse mesh, a hierarchy of 
meshes T := {71, I = 0,..., L} is constructed by successive uniform refinement, 
see, e.g., [7j. This hierarchy is used to construct the geometric multigrid solver; 
the choice to start with I = 0 guarantees that there is at least one inner degree 
of freedom if the mesh on level I is restricted to one T G T- 2 - 

The discretization of Q uses conforming linear finite elements (FE) on 71 
that leads canonically to a nested sequence of finite element spaces Vb C Fi C 
... C Fl C and a corresponding family of linear systems 

Aiui=l-, l = 0,...,L, (2) 

where the Dirichlet boundary conditions are included. We apply multigrid cor¬ 
rection schemes in V-cycles with standard components to ([^. In particular, 
we use linear interpolation and its adjoint operator for the inter-grid transfer, 
a hybrid variant of a Gauss-Seidel updating scheme in three pre- and post¬ 
smoothing steps and a preconditioned conjugate gradient (PCG) method as 
coarse grid solver. For a general overview of multigrid methods, we refer to 
[13 [33]. In highly parallel multigrid frameworks for distributed memory archi¬ 
tectures 13 [3 [13 [33 EH, the work load for solving PDE problems is typically 
distributed to different processes based on a geometrical domain partitioning. 
Here, the base tetrahedral mesh T -2 defines the partitioning used for paralleliza¬ 
tion. Each tetrahedron T G T -2 is associated with a processor, and this also 
induces a process assignment for all refined meshes. In general, several of the 
coarse mesh tetrahedra would be assigned to each processor so that a good load 
balancing occurs. In our simple model case, all tetrahedra are equally refined 
and thus induce the same load, so that for the sake of simplicity, we assume 
that the number of processors is equal to the number of tetrahedra in T- 2 , i-e., 
we have a one-to-one mapping of base mesh tetrahedra to processors. 

2.2 Fault model and pollution effect 

For our study, we concentrate on a specific fault model under assumptions sim¬ 
ilar to (33 IM] • We restrict our study, for simplicity, to the case that only one 
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processor crashes. All strategies can be extended easily to a failure of more pro¬ 
cessors, since they only rely on the locality of the fault, see also Sec. [5] where 
a large scale simulation with two faults at different locations will be presented. 

The faulty subdomain ftp d is assumed to be a single open tetrahedron in 
T- 2 i and flj := fl\flp is called intact or healthy subdomain. Then, the intact 
and faulty regions are separated by an interface T := dflj n dflp. We denote 
the unknowns up, ui and up with respect to the subdomains ftp, flj and the 
interface T, respectively, cf. Fig. 


Faulty domain: up in ftp 
Interface: ur on F 
Intact domain: ui in fli 


Figure 1: Domain with a faulty subdomain flp, interface F and intact sub- 
domain flj. 

Moreover, we assume that a failure occurs only after a complete multigrid 
cycle and is reported immediately. The solution values in the faulty subdomain 
are set to zero as initial guess. After the local re-initialization of the problem 
and a possible recovery step, we continue with multigrid cycles in the solution 
process. We focus on three categories of computational jobs in the solution 
process: A job in which no fault occurs is called a fault-free job, a do-nothing 
job is a job in which a failure occurs but no special recovery strategy is applied 
(besides reinitializing the solution locally to zero), and a recovery ioh stands for 
a job where recovery strategies are applied after a fault happened. We assume 
a process experiences a fault after kp cycles and count the number of iterations 
necessary to reach the stopping criteria (here: 10“^®) in the case of a fault-free 
job fcfj-ee ^ faulty job with possible recovery fc£g^yj.(-y. To quantify the extra 
iterations compared to a fault-free job, we introduce a relative Cycle Advantage 
(CA) parameter k defined by 



^faulty ^free 

- - -. (3) 

Intuitively, we expect that k € [0,1]. The situation k = 0 represents the case 
when the recovery algorithm reaches the stopping criteria with the same number 
of iterations as in a fault-free job. For increasing k, more and more additional 
iterations are required to reach the stopping criteria. For k = 1, kp additional 
multigrid iterations have to be carried out, and this means that essentially all 
information that had been accumulated before the fault, has been lost. 

The following simple test setting illustrates the need for special recovery 
strategies. We consider a fault scenario with 16 million unknowns and a loss of 
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0.3 million unknowns in case of a process crash in the unit cube fl = (0,1)^. The 
faulty subdomain is located similar to Fig. On the left of Fig. we show the 
decay of the residual within a global parallel multigrid scheme. The fault occurs 
after kp = 5 multigrid iterations and in the case of a do-nothing job, the residual 
after the fault is highly increased resulting in four additional multigrid steps to 
obtain a given tolerance compared to the fault-free job. Further tests show that 
the number of additional steps is almost always equal to kp — 1. Thus a do- 
nothing run results in a k close to one. A favorable pre-asymptotically improved 
convergence rate after the fault often helps to save one cycle, but besides from 
this, the extra cost incurred by the fault is essentially the number of cycles that 
have been performed before the fault. The situation becomes even worse when 
multiple faults occur. 





Figure 2: Residual decay (left) and residual distribution after the failure (mid¬ 
dle) and after one additional global V-cycle (right) restricted to a cross section 
through the domain ft, {a := \ogiQ{\Residual\)). 


The middle illustration in Fig. visualizes the residual on a cross section 
through the domain directly after the failure and re-initialization. The resid¬ 
ual distribution when one additional global multigrid cycle has been performed 
is given on the right. Note that large local error components within Qp are 
dispersed globally by the multigrid cycle. Although the smoother transports in¬ 
formation only across a few neighboring mesh cells on each level, their combined 
contributions on all grid levels leads to a global pollution of the error. Though 
the residual decreases globally, the residual in the healthy domain flj increases 
by this pollution effect. A possible remedy is based on temporarily decoupling 
the domains in order to avoid that the locally large residuals can pollute into the 
healthy domain. This is motivated by the asynchronous multilevel algorithms in 
[45] and leads to strategies that combine domain decomposition [43l [52] and 

multigrid techniques. They are essentially based on hierarchical and partitioned 
data structures as will be discussed in the next section. 


3 Software requirement for algorithmic resilience 

In this section, we introduce a software architecture that is suitable to deal with 
faults and that supports numerical recovery procedures in the case of faults. 
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The main ingredient for the recovery algorithms is a hybrid data structure that 
allows to combine multigrid mesh hierarchies with tearing and interconnecting 
strategies from domain partitioning. All our numerical results will be carried 
out within the HHG software library, El [Ml 131], but the essential techniques 
can be adapted to other software concepts, since they essentially only exploit 
the redundancy of data in ghost layers as they are used also in other distributed 
memory parallel frameworks. 

3.1 Hybrid hierarchical grid data structures 

Often, the parallel communication of processes is organized in ghost layers 
(sometimes also called halos) which redundantly store copies of master data. 
This is a convenient technique to accommodate data dependencies across pro¬ 
cessor boundaries. Only the original master values can be written by the algo¬ 
rithm, and thus, once a master value has been changed, the associated ghost 
values must be updated to hold consistent values. We here propose a system¬ 
atic construction of the ghost layer data structures that is induced by the mesh 
geometry. 

The tetrahedra of the unstructured coarsest mesh 71 2 provide natural pro¬ 
cess boundaries for all rehned meshes. Note that the refined meshes will have 
nodes that are located on the vertices, edges, faces, and the interior of the initial 
elements in T- 2 - For two tetrahedra in 3D, this is illustrated in Fig. [^ below, 
but for the sake of simplicity we illustrate the concept for the 2D case in Fig. 
The upper left of Fig. [^illustrates how first all fine mesh nodes are classified. 
In the upper right illustration, then the ghost layers are introduced. This is 
achieved by separating the two connected triangular elements, and identifying 
the common edge as a separate entity. For the software architecture, this clas¬ 
sification of nodes induces a system of container data structures. In particular, 
there are 3D containers to hold the nodes in the interior of each G T- 2 , then 
2D containers for those nodes that lie on each coarse mesh face Fij = TiD Tj, 
ID containers for the nodes that lie on the coarse mesh edges, and eventually OD 
containers that hold trivially the nodes coinciding with the vertices of T- 2 - In 
a parallel setting, we will allocate all these container objects onto the different 
processors. 

Conceptually, the next step is to introduce the ghost layers. The geometrical 
classification above had distributed the master copies of each node into separate 
containers. As indicated in Fig. (top-right) (see also Fig. for 3D), these 
containers can be enriched by ghost nodes which are copies of master nodes that 
are stored elsewhere. Thus all the fine grid nodes that rest on the boundary of 
a T G T -2 become ghost nodes in that T, similarly, in the face data structure, 
the nodes that lie on the edges become ghost nodes, and eventually the end 
points of edges become ghost nodes for the edge data container. Furthermore, 
each of the face, edge, and node containers is enriched by one additional layer 
of ghost nodes that hold additional copies of master values. These extra ghost 
layers are essential for e.g. efficiently implementing a Gauss-Seidel iteration for 
the master nodes of the corresponding face container. 
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C Copying to update 
ghost points 


Process boundary 



Figure 3: Top left: Two structured refined elements including grouping of the 
finite element nodes. Top right: Four containers for edge and interior nodes 
and updates of their corresponding ghost points. Bottom: Duplication of an 
edge container for a parallel example and synchronization via MPI. Memory 
representation of the data structures and one stencil for two different containers. 
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For the parallel implementation, we eventually introduce additional copies 
of the face, edge, and vertex containers so that they can be uniquely allocated 
on the processors of a distributed memory system. In the HHG implementation, 
the communication between nodes is thus split on two subtasks. One consist in 
updating the ghost node values in the containers that are local to a processor 
by simple local copy operations. The second is the actual synchronization of the 
interface data structures (face, edge, node) by message passing in the distributed 
memory system. 

We remark that this construction obviously leads to a data redundancy 
and extra cost in data storage, but that for significantly refined meshes the 
additional memory cost is of lower order complexity. As we will see below, this 
redundancy enables efficient implementations of the parallel multilevel solver 
algorithms, including a systematic recovery after a fault, i.e., the numerical 
recovery of the data in these structures. 

Note also, that these containers can be implemented using simple array data 
structures and can be accessed with integer indices. This can be systemati¬ 
cally exploited to design a highly efficient implementation that avoids indirect 
addressing and that permits efficient vectorizable looping through the data, as 
indicated in Fig. (lower left). Further note, that the interface data structures 
can also be used to implement Dirichlet or Neumann boundary conditions for 
those tetrahedra that lie at the domain boundary, see also Subsec. |3.2| for the 
resulting linear algebraic structure. We point out that in case of a fault, the 
master nodes of a tetrahedral container may be lost and no copies of the master 
values will exist anywhere. However, clearly for all of the interface structures, 
the above construction will provide ghost copies, from which the data can be 
restored. 

In the following, we will design suitable recovery algorithms for this task. 
As a first step, we will study the resulting linear algebra structures. 

3.2 Equivalent algebraic formulations 

The data structure discussed in the previous subsection allows different equiv¬ 
alent algebraic formulations of ([^ that play an important role in the following 
sections. Fig. [^illustrates the ghost-layer structure of a coarse mesh face with 
the master and ghost nodes of a refined mesh. For the interface container 
structures (faces, edges, vertices), a recovery is directly possible, since we can 
assume that redundant copies exist in the system. However, for the volume el¬ 
ements (tetrahedra), their inner nodes are not available, and thus they must be 
re-constructed numerically. The ghost layer structure and the associated data 
redundancy can be represented algebraically by rewriting equivalently as 
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Figure 4: Two neighboring master elements (left) and ghost-layer structure of 
the common master face (right) 


The sub-matrices are associated with the block unknowns, in more general set¬ 
tings, they also depend on the basis functions and the PDF. Because of the 
locality of the support of the basis function, we can identify A^i with Aj-p^ 
and Atf with Aj.p^. Row 2 and row 4 in Q guarantee the consistency of the 
redundant data at the interface between the master elements. Row 3 reflects, 
the ghost layer structures associated with the master faces. We recall that by 
our assumptions the data Up and Up^ are lost but Uj and Up^ are still available. 
Although it is a priori not known what the intact domain will be, the data struc¬ 
ture allows always such a presentation. If a processor fails which is associated 
with master faces, then the recovery for Up can be trivially performed using the 
redundant information of row 2. Thus, we focus in the following only on the 
recovery of Up and Up^. 

The ghost layer structure does not only permit access to Apr but also to 
decompose it into Ar^r^. + More precisely for each vertex on a master 

face, the two sub-stencils associated with 12/ and flp are available, see also Fig. 

Having the sub-stencils at hand, we can rewrite ([^ equivalently in terms of 
the additional flux unknown Ap^. as 
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Here — Ap^ stands for the discrete flux out of fl/ and into fflp- It reflects a 
Neumann boundary condition for the intact, healthy domain. In Q and ([^, 
we thus find the typical algebraic structure of Dirichlet and Neumann subprob¬ 
lems associated with the faulty and the intact subdomain, respectively. This 
observation motivates the design of our parallel recovery algorithms. 
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Figure 5: Stencil (left) and sub-stencil (right) structure for a vertex on a master 
face 

4 Local recovery strategy 

So far we have only considered local recovery strategies in the faulty subdomain. 
During the recovery, the processes in the intact subdomain are halted until 
the recovery in the faulty subdomain has finished. However, for the overall 
efficiency we cannot neglect the time spent in the recovery process. Thus, we 
extend our approach into two directions. Firstly, we introduce a superman 
strategy to speed up the local recovery process itself, and secondly we propose 
a global recovery such that on the faulty and intact subdomains parallel but 
asynchronous processes take place. 

4.1 The local superman 

In parallel geometric multigrid, load balancing can be often carried out with 
respect to the degrees of freedom. Here the situation is quite different. The 
multigrid iteration in the faulty subdomain must deal with a much higher resid¬ 
ual compared to the residual on the healthy domain. To compensate for this, we 
use a local superman processor, i.e. we over-balance the compute power in the 
faulty subdomain. Technically, the additional compute resources for realizing 
the superman can be provided by additional parallelization. We propose here, 
that e.g. a full (shared memory) compute node is assigned to perform the local 
recovery for a domain that was previously handled by a single core. This can 
be accomplished by a tailored OpenMP parallelization for the recovery process. 
Alternatively, a further domain partitioning ot flp can be used together with 
an additional local distributed memory parallelization. Alternatively, one may 
think to exploit accelerators, such as GPUs or a Xeon Phi. 

On the other hand in practice, Hi? is much smaller than D/ and will typically 
fit in a single processor memory. Thus a suitably designed recovery process 
will automatically benefit from locality. In this case, the local process will be 
automatically faster than a global process of the same type involving full message 
passing communication over long distances. Moreover, in a standard multigrid 
implementation, the coarsest grid for D p contains only one inner node, and thus 
the coarse grid problem within the local multigrid cycle is trivial to solve and 
does not contribute to the time per multigrid cycle. This is in contrast to large 
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scale parallel computations, when the cost of the coarse grid solver can often not 
be neglected, since it may contribute 50% or more of the total time to solution. 

Since we here assume that local and global computation are performed si¬ 
multaneously, the time for the recovery is presented as the maximum of the time 
spent in Qj and in ftp. By introducing r^super as the acceleration ratio of the 
cycles that can be performed simultaneously in flj’, parallel to one iteration in 
17/, we obtain a quantitative measure for the expected speed up. This speedup 
can be caused by an increase of the compute power and/or an decrease of the 
communication overhead and a complexity decrease in the coarse grid solver. If 
■'Isuper = 1, there is no speedup, i.e., one global cycle can be performed as fast 
as a local one. For the ideal but practically not relevant situation 77super = oo, 
the local recovery of Sec. |^does not contribute to the global run time. 

4.2 Dirichlet-Dirichlet recovery strategy 

In the Dirichlet-Dirichlet recovery, we freeze the values Uti the interface F/. 
This allows to compute the two subproblems in D/? and flj independently and 
consequently no communication between flp and 17/ is necessary. Thus, it is 
guaranteed that no defect data is polluted into the healthy subdomain during the 
recovery. On both subdomains, we iterate now on decoupled Dirichlet problems 
with boundary data on F given by Up. Obviously, at some point we have to 
reconnect the two subdomains again. The algorithm for the Dirichlet-Dirichlet 
recovery is presented in Alg. 

Algorithm 1 Dirichlet-Dirichlet recovery (DD) algorithm 
1: Solve ([^ by multigrid cycles. 

2 : if Fault has occurred then 
3: STOP solving ([^. 

4: Recover boundary data Up^ from line 4 in Q 

5: Initialize Up with zero 

6: In parallel do: 

7: a) Use np MG cycles accelerated by r^super to approximate line 5 in Q; 

8: AppUp = f_p — ApppU-p^ 

9: b) Use ni MG cycles to approximate line 1 in 0 

10: AjiUj = /p — AjPjUp^ 

11: RETURN to line 1 with new values Uj in 17/ and Up in flp. 

12: end if 


4.3 Dirichlet-Neumann recovery strategy 

In the Dirichlet-Neumann recovery strategy, we do not freeze the interface values 
but treat them as Neumann boundary data in the healthy subdomain. By 
doing so, we use a one-directional coupling of the faulty subdomain 17 p and the 
healthy subdomain 17/. On 17/ we approximate a Neumann boundary problem 
with static data, whereas on Qp, we approximate a Dirichlet problem with 
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dynamic boundary data. After each multigrid cycle on flj, the newly computed 
interface values are asynchronously communicated via Up onto Ur^. Hence, 
we only avoid communication from the faulty to the healthy domain but still 
keep communicating from the intact to the faulty subdomain. As in the case 
of the Dirichlet-Dirichlet recovery strategy, it is necessary to fully interconnect 
both subdomains after a couple of cycles. The algorithm is presented in Alg. 


9 

10 

11 

12 

13 

14: 

15 

16 
17 


Solve by multigrid cycles, 
if Fault has occurred then 
STOP solving ([^. 

Compute the Neumann condition from line 2 in ([^: 

Ap^ = - f^TjTjUrj 

Initialize Up with zero. 

In parallel do: 

a) Perform np cycles with acceleration lysuper^ 

Recover/update boundary values from line 5 in ([^. 

Use 1 MG cycle to approximate line 6 in ([^: 

Appup = /^ — ApppUp^ 

b) Perform nj cycles: 

Use 1 MG cycle to approximate lines 1+2 in (§ 

(An Ajp\(up\^ (f_^\ 

V"^PU Ap^pJ \upj \^r,J 

Update boundary values Up from line 3 in (§. 

RETURN to line 1 with values Uj in U/, Up on P and Up in Qp. 
end if 


4.4 Comparison of the recovery strategies 

In this subsection, we compare the Dirichlet-Dirichlet (DD) and Dirichlet-Neumann 
(DN) recovery with the local recovery strategy (LR), where calculations are only 
performed in the faulty subdomain while the healthy domain stays idle. 

From the left to the right in Fig. the three different fault geometries and 
macro-meshes are shown which is uniformly refined seven times for the following 
computations. Scenario (I) has 2 millions unknowns and the fault (red marked 
tetrahedron) is located with two faces at the boundary of the computational 
domain and affects about 16.7% of the total unknowns. Scenario (II) has 16 
millions unknowns, a corruption by the fault of 2.0% (like in Subsec. ??), and 
the faulty domain has one edge coinciding with the boundary. Scenario (III) 
has 56 millions unknowns, the faulty domain is floating in the center of the fl 
and affects only 0.6% of the unknowns. In Tab. and Tab. we present 
the values for the cycle advantage k with respect to the three different fault 
scenarios. We compare these strategies for two different superman speed up 
factors ??super = 2 and rjsuper = 5. In addition, we also illustrate the influence of 
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Figure 6: From left to right: Scenario (I), Scenario (II) and Scenario (III) 


n/ in the recovery strategy before interconnecting the faulty and intact domain. 
Please note that the choice nj = 0 reflects a do-nothing job. We always fix 
np = \ni * r^superl and run the numerical tests for ni G {0,1, 2,3,4}, i.e., in 
the case nj = 3 and jysuper = 2, we have np = 6. To evaluate k, we set 
^faulty to the sum of the required global multigrid cycles and n/. We point out 
that in Sec. we had jysuper = 1, be., n/ = np, and we did not add n/ to 
the global multigrid cycle to get fcfauity From the results of Sec. it is now 
obvious that the local recovery strategy without ?7super > 1 is of no interest. In 


Tab. I: Comparison of the Cycle Advantage k for the local, Dirichlet-Dirichlet 
and Dirichlet-Neumann recovery strategy for a fault after 5 iterations. 


Superman speedup 77super = 2 



Scenario (I) 

Scenario (II) 

Scenario (III) 

nj 

LR 

DD 

DN 

LR 

DD 

DN 

LR 

DD 

DN 

0 

0.60 

0.60 

0.60 

0.80 

0.80 

0.80 

0.80 

0.80 

0.80 

1 

0.20 

0.00 

0.00 

0.40 

0.40 

0.40 

0.20 

0.20 

0.00 

2 

0.40 

0.20 

0.00 

0.40 

0.40 

0.20 

0.40 

0.00 

0.00 

3 

0.60 

0.40 

0.20 

0.60 

0.60 

0.40 

0.60 

0.20 

0.00 

4 

0.80 

0.60 

0.40 

0.80 

0.80 

0.60 

0.80 

0.40 

0.20 


Superman speedup 77super = 5 


71 / 

LR 

DD 

DN 

LR 

DD 

DN 

LR 

DD 

DN 

0 

0.60 

0.60 

0.60 

0.80 

0.80 

0.80 

0.80 

0.80 

0.80 

1 

0.20 

0.00 

0.00 

0.20 

0.20 

0.20 

0.20 

0.00 

0.00 

2 

0.40 

0.20 

0.00 

0.40 

0.40 

0.20 

0.40 

0.00 

0.00 

3 

0.60 

0.40 

0.20 

0.60 

0.60 

0.40 

0.60 

0.20 

0.00 

4 

0.80 

0.60 

0.40 

0.80 

0.80 

0.60 

0.80 

0.40 

0.20 


Tab. the results are given for a fault after 5 global multigrid iterations. All 
recovery strategies can drastically reduce the delay compared to a do-nothing 
strategy since they can attain a considerably smaller cycle advantage k. For 
Scenario (I) and Scenario (III), we can even achieve At = 0. The Scenario (III) 
is more relevant for applications on mid-size clusters, whereas Scenario (I) is 
more relevant for small desktop systems. Relevant cases for peta-scale systems 
will be considered in the next section. We observe a clear advantage in using 
a global recovery strategy instead of a local recovery strategy. In the case of 
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the (LR) strategy, most of the machine stays idle for n/ multigrid cycles. Only 
in the case of the two global recovery strategies, we fully exploit the compute 
power of the total system. The Dirichlet-Neumann recovery strategy is the most 
powerful and robust one with respect to the choice of nj. In all cases, nj plays 
a crucial role in the overall performance. If nj is too large, we are over-solving 
the subdomain problems, if nj is too small, then the approximation on the 
faulty subdomain is too inaccurate and the global performance suffers from the 
pollution of the local defect into the global subdomain 


Tab. 2: Comparison of the Cycle Advantage k for the local, Dirichlet-Dirichlet 
and Dirichlet-Neumann recovery strategy for a fault after 11 iterations. 


Superman speedup 77super = 2 


nj 

Scenario (I) 

LR DD DN 

Scenario (II) 

LR DD DN 

Scenario (III) 

LR DD DN 

0 

0.82 

0.82 

0.82 

0.91 

0.91 

0.91 

0.91 

0.91 

0.91 

1 

0.64 

0.64 

0.63 

0.82 

0.82 

0.82 

0.64 

0.64 

0.64 

2 

0.55 

0.55 

0.55 

0.55 

0.55 

0.55 

0.45 

0.45 

0.45 

3 

0.36 

0.36 

0.36 

0.45 

0.45 

0.45 

0.36 

0.36 

0.36 

4 

0.36 

0.27 

0.27 

0.36 

0.36 

0.36 

0.36 

0.27 

0.27 

5 

0.45 

0.36 

0.27 

0.45 

0.45 

0.36 

0.45 

0.27 

0.27 

6 

0.64 

0.45 

0.36 

0.55 

0.55 

0.45 

0.45 

0.36 

0.36 


Superman speedup 77super = 5 


ni 

LR 

DD 

DN 

LR 

DD 

DN 

LR 

DD 

DN 

0 

0.82 

0.82 

0.82 

0.91 

0.91 

0.91 

0.91 

0.91 

0.91 

1 

0.36 

0.36 

0.36 

0.36 

0.36 

0.36 

0.27 

0.27 

0.27 

2 

0.18 

0.09 

0.00 

0.18 

0.18 

0.09 

0.18 

0.00 

0.00 

3 

0.27 

0.18 

0.09 

0.27 

0.27 

0.18 

0.27 

0.09 

0.09 

4 

0.36 

0.27 

0.18 

0.36 

0.36 

0.27 

0.36 

0.18 

0.18 


To obtain a better feeling for the optimal np depending on kp, we consider in 
Tab.j^the situation of a fault after 11 multigrid cycles. In contrast to Tab.[^ we 
must select nj larger to get the best cycle advantage k. This results from the fact 
that now the local subproblem solver on the faulty domain must counterbalance 
the smaller residual on the healthy domain and achieve a higher accuracy. 

The best choice for all strategies seems to have a np close to kp, e.g., for a 
fault after kp = 11 iterations and rysuper = 5 in Tab. the best choice is n/ = 2 
and np = 10. The later the fault occurs, the more powerful the recovery strategy 
has to be. In particular without a significantly increased compute power per 
degree of freedom in the faulty domain compared to the healthy domain, we 
cannot fully compensate for the fault. 


5 Parallel recovery 

Now we investigate both global recovery strategies in a parallel setting on a 
state-of-the-art peta-scale system. Our test system is JUQUEEN, an IBM Blue 
Gene/Q system with a peak performance of more than 5.9 petaflop/s. Each of 
















Resilience for Multigrid Software 


16 


the 28 672 nodes is equipped with 16 cores. Each core can execute up to four 
hardware threads to help hiding latencies. A five-dimensional torus network 
results in short communication paths within the system. The HHG software 
used in this article is compiled by the IBM XL C/C-l—I- compiler V12.1 using 
flags -03 -qstrict -qarch=qp -qtune=qp) and is linked to MPICH2 version 1.5. 
Technically, we realize the superman strategy by a logical splitting of the faulty 
tetrahedron and by employing additional MPI processes to perform the recovery. 
The software was compiled by the IBM XL C/C-I--I- compiler VI2.I (using flags 
-03 -qstrict -qarch=qp -qtune=qp) and linked to MPICH2 version 1.5. 

Tablej^presents weak scaling results with 6x (2^-|-l)^ coarse mesh tetrahedra, 
I G {1, 2,3,4}. After refinement to full resolution, each coarse mesh tetrahedron 
contains about 2.8 • 10® grid points and is assigned to one hardware thread of 
JUQUEEN. Hence, the ratio between faulty and healthy domain decreases from 
0.6% to 3.4 • 10“® % during the weak scaling sequence presented here. The first 
column displays the total number of unknowns in the system. In the largest 
computations, we solve for more than 8.2 • 10^® degrees of freedom on 14 743 
compute cores. The numbers in the rest of the table represent the additional 
time-to-solution compared to a fault-free run. A negative number here means 
that due to the superman strategy, the computation with recovery terminates 
with the stopping criterion being satisfied faster than a fault-free run. As we 
can see from Tables this surprising effect is possible due to the fact that 
the temporary decoupling of the subdomains reduces the communication, but 
clearly this kind of saving is not very significant. 

In the second column, we display the additional time for the do-nothing job. 
The retardation of the time-to-solution ranges from 27.5% to 34.6%. For exam¬ 
ple, for a large problem with 4 353® unknowns, the time-to-solution increases 
from 39.60s to 50.49s. To analyze the global recovery strategies, we must real¬ 
ize supermen with a higher compute power than a normal processor. Here we 
implement supermen with rysuper G {1,2,4,8} by replacing the single faulty pro¬ 
cessor by 77super Spare processors. As in the previous section, the DN strategy 
is more robust with respect to nj compared to the DD strategy. However, a su¬ 
perman power of Tysuper = 8 does not improve the results significantly compared 
to jysuper = 4. Already with rysuper = 4, we can in most cases fully compensate 
the fault and achieve the required accuracy with no increased time-to-solution. 
Setting n/ = 1, is not a good choice since it requires a large r/super to compen¬ 
sate for accuracy loss in the faulty domain. The later the fault occurs the more 
it is important to set nj and thus rip large enough, i.e., re-couple not too early, 
and to have a strong enough superman. We recall that np = rjsuperni. The DN 
recovery strategy in combination with rysuper = 4, n/ = 2 or n/ = 3 yields the 
optimal choice with respect to time-to-solution and cost efficiency. 

At this stage, we point out that the cost of the superman becomes insignif¬ 
icant in a large scale computation, even if 2, 4, or 8 spare processors must be 
kept idle until they are called for help. This is simply an effect of scale. When 
we use 14 743 processors, even providing 12 cores for three supermen of strength 
?7super = 4 (which would be able to compensate for the unrealistic scenario of 
three faults in a single solve) constitutes only an extra cost of less than 0.1%. 
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Finally, we test in Tab. [^a scenario where two faults at different locations 
occur at different times. The first processor crash happens after 5 multigrid 
iterations and the second one after 9. Since we select n/ < 4, the recovery of the 
first fault has finished when the second one occurs. A failure-prone cluster with 
many consecutive faults does not provide a suitable environment for trustable 
high-performance computations such that a certain interval between the crashes 
is reasonable. For 4 353^ unknowns, the time-to-solution increases form 39.60s 
to 59.34s and the additional consumed computation time for the scaling ranges 
due to the second fault now from 47.8% to 55.7% of the original time where no 
fault has occurred. 

We specify Tysuper = 4 and study the performance with respect to nj. For 
71/ = 1, we obtain already a significant improvement compared to the do-nothing 
job. Nevertheless with nj = 2 or nj = 3, we observe much better results. As a 
rule of thumb we can set for moderate values of ri/, nj such that np = rjsnperni « 
nj + kp — 2. This rule of thumb is based on the observation that we need roughly 
kp — 2 iteration to compensate for the fault in the local recovery and on the 
assumption that our global accuracy is still governed by the one in the faulty 
domain and does not suffer from the reduced communication. If this rule of 
thumb would require a value larger than three or four for n/, then one should 
increase the superman power. If nj is too large then too many steps without 
information exchange at the interface between healthy and faulty subdomain 
are carried out and the multigrid acts as direct solver on the subdomains with 
inexact boundary data. Thus short time-to-solutions require a careful balancing 
of volume and surface components, and a nj and np not too small but also not 
too large. 

6 Conclusion and Outlook 

This paper presents first insight in constructing a fault tolerant parallel multi¬ 
grid solver. Hard faults result in a loss of dynamical data in a subdomain. 
Geometric multigrid solvers are inherently well suited to compensate such a 
loss of process state and to reconstruct the data based on a redundant storage 
scheme for numerical values only along the subdomain interfaces and using ef¬ 
ficient recovery algorithms for the bulk of the data. To recover lost numerical 
values, local reconstruction subproblems with Dirichlet boundary conditions are 
solved approximately. 

It is found that approximate solvers using a single fine grid, such as relaxation 
schemes or Krylov space methods, as well as conventional domain decomposi¬ 
tion techniques with direct subdomain solvers are much too inefficient and are 
thus unsuitable to serve as a basis for practical fault recovery techniques. In 
contrast, local multigrid cycles can be used to recompute the lost data with the 
least numerical effort. This strategy becomes efficient in both cost and time-to- 
solution when the local multigrid recovery cycles are accelerated by a superman 
strategy, as it can be realized by an excess parallelization. 

Further, we investigate methods that combine local and global processing 
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in an asynchronous way. A Dirichlet-Dirichlet recovery or Dirichlet-Neumann 
recovery strategy can be used, which both iterate in the faulty subdomain and 
the healthy subdomain independently, until the local solution has been recovered 
sufficiently well. Only then the subdomains are reconnected to continue the 
regular multigrid solution process. Both strategies can further reduce the delay 
in case of a fault. For different fault scenarios, we observe that the Dirichlet- 
Neumann strategy should be preferred. Combined with the superman compute 
power on the small faulty subdomain, the global recovery techniques can result 
if a full numerical compensation of the fault while costing no additional compute 
time. The robustness and flexibility of the designed algorithms are tested on a 
state-of-the art peta-scale system including large scale simulation with close to 
10^^ unknowns. We thus believe that the superman strategy will be a viable 
and resource efficient approach to achieve ABFT on future exa-scale systems 
which may have many millions of cores. 
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Tab. 3: Additional time spans (in seconds) of different global recovery strategies 
during weak scaling with a fault after kp = 7 iterations. 


Size 

No Rec 

DD Strategy nj 

= 1 


DN Strategy nj 

= 1 


^super — 1 

2 

4 

8 

^super — 1 

2 

4 

8 

769^ 

13.00 

12.97 

10.30 

2.66 

-0.02 

12.98 

10.30 

2.66 

0.02 

1281^ 

13.05 

12.79 

10.37 

2.52 

0.28 

12.78 

10.40 

2.53 

0.23 

2 305^ 

13.25 

10.34 

7.86 

2.98 

0.08 

10.66 

7.91 

2.98 

0.08 

4 353^ 

10.89 

10.81 

5.50 - 

0.02 

0.15 

10.57 

5.29 - 

■0.21 

-0.83 




DD Strategy nj 

= 2 


DN Strategy nj = 2 


Size 

No Rec 

^super — 1 

2 

4 

8 

^super — 1 

2 

4 

8 

769^ 

13.00 

12.72 

5.03 - 

■0.10 

-0.24 

12.74 

5.05 

-0.09 

-0.24 

1281^ 

13.05 

12.52 

5.00 - 

0.30 

-0.03 

12.53 

5.01 

-0.28 

-0.03 

2 305^ 

13.25 

10.05 

4.98 

0.07 

-0.22 

10.11 

5.03 

0.07 

-0.22 

4 353^ 

10.89 

7.81 

2.52 

2.28 

2.49 

7.63 

2.33 

-0.54 

-0.35 



DD Strategy nj 

= 3 


DN Strategy n/ = 3 


Size 

No Rec 

^super — 1 

2 

4 

8 

^super — 1 

2 

4 

8 

769^ 

13.00 

9.95 

2.28 - 

0.36 

-0.48 

9.99 

2.31 

-0.32 

-0.47 

1281^ 

13.05 

9.26 

2.17 - 

■0.56 

-0.31 

9.70 

2.15 

-0.54 

-0.32 

2 305^ 

13.25 

9.78 

2.15 - 

■0.21 

-0.52 

9.81 

2.15 

-0.21 

-0.51 

4 353^ 

10.89 

7.53 

2.21 

1.99 

2.16 

7.27 

-0.65 

-0.85 

-0.68 




DD Strategy nj 

= 4 


DN Strategy n/ 

= 4 


Size 

No Rec 

^super — 1 

2 

4 

8 

"^super — 1 

2 

4 

8 

769^ 

13.00 

9.71 

2.02 

1.92 

1.78 

9.73 

2.04 

1.96 

1.80 

1281^ 

13.05 

9.43 

1.88 

1.71 

1.80 

9.43 

-0.68 - 

■0.82 

-0.59 

2 305^ 

13.25 

9.54 

-0.75 - 

■0.53 

2.56 

9.53 

-0.71 - 

■0.51 

-0.79 

4 353^ 

10.89 

6.92 

4.32 

4.07 

3.91 

7.15 

1.67 

1.47 

1.65 


Tab. 4: Additional time spans (in seconds) of the Dirichlet-Neumann recovery 
strategies during weak scaling with a fault after kp = 5 and kp = 9 iterations. 


Size 


DN Strategy ??super = 4 

No Rec 

m = 1 

2 

3 

4 

769® 

19.21 

8.01 

0.05 

-0.38 

4.22 

1281® 

21.27 

10.58 

-0.15 

-0.68 

3.95 

2 305® 

18.50 

7.91 

-0.33 

-0.87 

3.76 

4 353® 

19.74 

5.81 

2.58 

4.61 

9.24 
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